perf(db): bulk CFG and dataflow DB writes via rusqlite#653
Conversation
Move AST node SQLite inserts from per-row JS iteration to a single native Rust transaction via napi-rs + rusqlite. The new bulkInsertAstNodes function opens the DB directly from Rust, pre-fetches parent node definitions, and inserts all rows in one transaction — eliminating the JS-native FFI overhead per row. The JS-side buildAstNodes tries the native fast path first (when all files have native astNodes arrays), falling back to the existing JS loop for WASM or mixed-engine scenarios. Target: astMs < 50ms on native full builds (was ~393ms).
The Rust connection omitted busy_timeout = 5000 which the JS-side connection.ts sets. Without it, SQLITE_BUSY is returned immediately on WAL contention instead of retrying for 5 seconds.
Explain why bundled is intentional: Windows CI lacks system SQLite, and dual-instance WAL coordination is OS-safe.
The Rust find_parent_id skipped definitions with end_line = NULL, but the JS findParentDef treats them as always-enclosing with a negative span (preferred over wider defs). This caused parent_node_id mismatches between native and JS paths.
Return 0 immediately on any insert_stmt.execute() failure so the transaction drops and rolls back, ensuring all-or-nothing semantics. Previously, .is_ok() silently swallowed row-level errors which could commit partial data and misfire the JS fallback causing duplicate rows.
…/codegraph into perf/ast-node-bulk-insert
Move CFG block/edge and dataflow edge inserts from JS iteration to Rust bulk operations, following the same pattern as bulk_insert_ast_nodes (6.9). Rust side: - cfg_db.rs: bulk_insert_cfg() resolves function node IDs, deletes stale data, inserts blocks+edges in a single rusqlite transaction - dataflow_db.rs: bulk_insert_dataflow() pre-builds node resolution cache (local-first, global fallback), inserts edges in a single transaction JS side: - cfg.ts: native fast path collects CfgFunctionBatch[] and delegates to Rust when all CFG is pre-computed by the native engine - dataflow.ts: native fast path converts DataflowResult (argFlows, assignments, mutations) into FileDataflowBatch[] for Rust insertion - Both fall back to existing JS paths when native addon is unavailable Target: cfgMs + dataflowMs < 50ms combined (from ~286ms with JS iteration)
Greptile SummaryThis PR extends the bulk-insert Rust optimization pattern (established in 6.9 for AST nodes) to the CFG and dataflow DB write phases, targeting a combined Confidence Score: 3/5Not safe to merge as-is: the dataflow bulk-insert path can create duplicate DB rows in normal usage, corrupting query results. Prior P1 feedback on fallback logic is well-addressed for both cfg.ts and dataflow.ts. The CFG implementation is correct and idempotent. However, a new concrete P1 remains: dataflow_db.rs commits partial results without a preceding DELETE, and the JS fallback re-inserts the same resolved edges into an unconstrained table. This is not a hypothetical — any project with calls to external/unresolvable functions will trigger it. A focused one-function fix (add DELETE-before-insert in bulk_insert_dataflow matching the build-stmts.ts pattern) would bring this to a 4 or 5. crates/codegraph-core/src/dataflow_db.rs and src/features/dataflow.ts — the duplicate-on-fallback path. Important Files Changed
Sequence DiagramsequenceDiagram
participant JS as buildDataflowEdges (TS)
participant Rust as bulk_insert_dataflow (Rust)
participant SQLite as SQLite DB
JS->>JS: Collect file batches from symbols.dataflow
JS->>Rust: bulkInsertDataflow(db.name, batches)
Rust->>SQLite: Open connection (READ_WRITE)
Rust->>SQLite: Phase 1 — resolve node IDs into cache
Rust->>SQLite: BEGIN TRANSACTION
loop For each edge
alt Node resolved
Rust->>SQLite: INSERT INTO dataflow (...)
else Node unresolvable
Rust->>Rust: continue (skip, no delete)
end
end
Rust->>SQLite: COMMIT
Rust-->>JS: inserted count
alt inserted === expectedEdges OR expectedEdges === 0
JS->>JS: return (success)
else inserted < expectedEdges (unresolvable nodes)
note over SQLite: Partial edges already committed
JS->>SQLite: JS fallback — INSERT same resolvable edges again
note over SQLite: Duplicates created, no UNIQUE constraint
end
Reviews (2): Last reviewed commit: "fix(db): add JS fallback on bulk-insert ..." | Re-trigger Greptile |
src/features/cfg.ts
Outdated
| if (!needsJsFallback) { | ||
| const processed = native.bulkInsertCfg(db.name, batches); | ||
| const withBlocks = batches.filter((b) => b.blocks.length > 0).length; | ||
| if (processed > 0) { | ||
| info(`CFG: ${withBlocks} functions analyzed (native bulk)`); | ||
| } | ||
| return; |
There was a problem hiding this comment.
Silent data loss on bulk-insert failure — no JS fallback
bulk_insert_cfg returning 0 indicates a Rust-side error (DB connection failure, missing table, failed transaction, etc.). The current code unconditionally returns after calling it, meaning any such failure silently drops all CFG writes without attempting the JS fallback path. This diverges from the established pattern in ast.ts, which explicitly checks the count and falls through on failure.
Compare ast.ts:
if (inserted === expectedNodes) {
debug(`AST extraction (native bulk): ${inserted} nodes stored`);
return;
}
debug(`AST extraction (native bulk): expected ${expectedNodes}, got ${inserted} — falling back to JS`);
// fall through to JS pathThe CFG path should follow the same pattern — check the return value and only return on success:
if (!needsJsFallback) {
const processed = native.bulkInsertCfg(db.name, batches);
const withBlocks = batches.filter((b) => b.blocks.length > 0).length;
if (processed > 0 || withBlocks === 0) {
if (processed > 0) info(`CFG: ${withBlocks} functions analyzed (native bulk)`);
return;
}
debug(`CFG: bulk insert returned 0 — falling back to JS`);
// fall through
}There was a problem hiding this comment.
Fixed — cfg.ts now checks processed === batches.length and falls through to the JS path on mismatch, matching the ast.ts reference pattern. Also handles the empty-blocks edge case by checking expectedFunctions === 0.
| if (!needsJsFallback) { | ||
| const inserted = native.bulkInsertDataflow(db.name, batches); | ||
| info(`Dataflow: ${inserted} edges inserted (native bulk)`); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Silent data loss on bulk-insert failure — no JS fallback
Same issue as in cfg.ts: bulk_insert_dataflow returning 0 silently drops all dataflow writes without attempting the JS fallback. If all batches have edges but Rust returns 0, no edges are written and no fallback is attempted while the caller observes a clean return.
The ast.ts pattern protects against this by checking inserted === expectedNodes. The dataflow path should do the same:
if (!needsJsFallback) {
const inserted = native.bulkInsertDataflow(db.name, batches);
const expectedEdges = batches.reduce((s, b) => s + b.edges.length, 0);
if (inserted === expectedEdges || expectedEdges === 0) {
info(`Dataflow: ${inserted} edges inserted (native bulk)`);
return;
}
debug(
`Dataflow: bulk insert expected ${expectedEdges} edges, got ${inserted} — falling back to JS`,
);
// fall through to JS path
}There was a problem hiding this comment.
Fixed — dataflow.ts now checks inserted === expectedEdges and falls through to the JS path on mismatch, matching the ast.ts reference pattern.
…sert # Conflicts: # src/types.ts
Match the ast.ts reference pattern: check the Rust return value against the expected count and fall through to the JS path when they disagree, preventing silent data loss on Rust-side errors.
|
Addressed all Greptile feedback in 1a20f46:
All 11 CFG and 23 dataflow integration tests pass. |
Summary
bulk_insert_cfg()andbulk_insert_dataflow()Rust napi functions that perform CFG block/edge and dataflow edge inserts via rusqlite in single transactions, bypassing JS iteration loopsbuildCFGData()andbuildDataflowEdges()that collect batches from pre-computed native extraction results and delegate to Rust bulk insertersDetails
Same optimization pattern as
bulk_insert_ast_nodes(6.9). The DB write phases for CFG (161ms) and dataflow (125ms) were identical JS code on both engines — this moves them to Rust when native engine is active.Rust side:
cfg_db.rs— resolves function node IDs, deletes stale data, inserts blocks+edges with block index→row ID mappingdataflow_db.rs— pre-builds node resolution cache (local-first, global fallback), inserts flows_to/returns/mutates edgesTarget:
cfgMs + dataflowMs < 50mscombined on native full builds (current ~286ms)Test plan